Scientific Methods and Knowledge

您所在的位置：网站首页 › knowledge of space develops rapidly › Scientific Methods and Knowledge

Scientific Methods and Knowledge

2023-10-27 12:53| 来源: 网络整理| 查看: 265

STATISTICAL INFERENCE AND HYPOTHESIS TESTING

Many scientific studies seek to measure, explain, and make predictions about natural phenomena. Other studies seek to detect and measure the effects of an intervention on a system. Statistical inference provides a conceptual and computational framework for addressing the scientific questions in each setting. Estimation and hypothesis testing are broad groupings of inferential procedures. Estimation is suitable for settings in which the main goal is the assessment of the magnitude of a quantity, such as a measure of a physical constant or the rate of change in a response corresponding to a change in an explanatory variable. Hypothesis testing is suitable for settings in which scientific interest is focused on the possible effect of a natural event or intentional intervention, and a study is conducted to assess the evidence for and against this effect. In this context, hypothesis testing helps answer binary questions. For example, will a plant grow faster with fertilizer A or fertilizer B? Do children in smaller classes learn more? Does an experimental drug work better than a placebo? Several types of more specialized statistical methods are used in scientific inquiry, including methods for designing studies and methods for developing and evaluating prediction algorithms.

Because hypothesis testing has been involved in a major portion of reproducibility and replicability assessments, we consider this mode of statistical inference in some detail. However, considerations of reproducibility and replicability apply broadly to other modes and types of statistical inference. For example, the issue of drawing multiple statistical inferences from the same data is relevant for all hypothesis testing and in estimation.

Studies involving hypothesis testing typically involve many factors that can introduce variation in the results. Some of these factors are recognized, and some are unrecognized. Random assignment of subjects or test objects to one or the other of the comparison groups is one way to control for the possible influence of both unrecognized and recognized sources of variation. Random assignment may help avoid systematic differences between groups being compared, but it does not affect the variation inherent in the system (e.g., population or an intervention) under study.

Scientists use the term null hypothesis to describe the supposition that there is no difference between the two intervention groups or no effect of a treatment on some measured outcome (Fisher, 1935). A commonly used formulation of hypothesis testing is based on the answer to the following question: If the null hypothesis is true, what is the probability of obtaining a difference at least as large as the observed one? In general, the greater the observed difference, the smaller the probability that a difference at least as large as the observed would be obtained when the null hypothesis is true. This probability of obtaining a difference at least as large as the observed when the null hypothesis is true is called the “p-value.”3 As traditionally interpreted, if a calculated p-value is smaller than a defined threshold, the results may be considered statistically significant. A typical threshold may be p ≤ 0.05 or, more stringently, p ≤ 0.01 or p ≤ 0.005.4 In a statement issued in 2016, the American Statistical Association Board (Wasserstein and Lazar, 2016, p. 129) noted:

While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted. This has led to some scientific journals discouraging the use of p-values, and some scientists and statisticians recommending their abandonment, with some arguments essentially unchanged since p-values were first introduced.

More recently, it has been argued that p-values, properly calculated and understood, can be informative and useful; however, a conclusion of statistical significance based on an arbitrary threshold of likelihood (even a familiar one such as p ≤ 0.05) is unhelpful and frequently misleading (Wasserstein et al., 2019; Amrhein et al., 2019b).

Understanding what a p-value does not represent is as important as understanding what it does indicate. In particular, the p-value does not represent the probability that the null hypothesis is true. Rather, the p-value is calculated on the assumption that the null hypothesis is true. The probability that the null hypothesis is true, or that the alternative hypothesis is true, can be based on calculations informed in part by the observed results, but this is not the same as a p-value.

In scientific research involving hypotheses about the effects of an intervention, researchers seek to avoid two types of error that can lead to non-replicability:

Type I error—a false positive or a rejection of the null hypothesis when it is correctType II error—a false negative or failure to reject a false null hypothesis, allowing the null hypothesis to stand when an alternative hypothesis, and not the null hypothesis, is correct

Ideally, both Type I and Type II errors would be simultaneously reduced in research. For example, increasing the statistical power of a study by increasing the number of subjects in a study can reduce the likelihood of a Type II error for any given likelihood of Type I error.5 Although the increase in data that comes with higher powered studies can help reduce both Type I and Type II errors, adding more subjects typically means more time and cost for a study.

Researchers are often forced to make tradeoffs in which reducing the likelihood of one type of error increases the likelihood of the other. For example, when p-values are deemed useful, Type I errors may be minimized by lowering the significance threshold to a more stringent level (e.g., by lowering the standard p ≤ 0.05 to p ≤ 0.005). However, this would simultaneously increase the likelihood of a Type II error. In some cases, it may be useful to define separate interpretive zones, where p-values above one significance threshold are not deemed significant, p-values below a more stringent significance threshold are deemed significant, and p-values between the two thresholds are deemed inconclusive. Alternatively, one could simply accept the calculated p-value for what it is—the probability of obtaining the observed result or one more extreme if the null hypothesis were true—and refrain from further interpreting the results as “significant” or “not significant.” The traditional reliance on a single threshold to determine significance can incentivize behaviors that work against scientific progress (see the Publication Bias section in Chapter 5).

Tension can arise between replicability and discovery, specifically, between the replicability and the novelty of the results. Hypotheses with low a priori probabilities are less likely to be replicated. In this vein, Wilson and Wixted (2018) illustrated how fields that are investigating potentially ground-breaking results will produce results that are less replicable, on average, than fields that are investigating highly likely, almost-established results. Indeed, a field could achieve near-perfect replicability if it limited its investigations to prosaic phenomena that were already well known. As Wilson and Wixted (2018, p. 193) state, “We can imagine pages full of findings that people are hungry after missing a meal or that people are sleepy after staying up all night,” which would not be very helpful “for advancing understanding of the world.” In the same vein, it would not be helpful for a field to focus solely on improbable, outlandish hypotheses.

The goal of science is not, and ought not to be, for all results to be replicable. Reports of non-replication of results can generate excitement as they may indicate possibly new phenomena and expansion of current knowledge. Also, some level of non-replicability is expected when scientists are studying new phenomena that are not well established. As knowledge of a system or phenomenon improves, replicability of studies of that particular system or phenomenon would be expected to increase.

Assessing the probability that a hypothesis is correct in part based on the observed results can also be approached through Bayesian analysis. This approach starts with a priori (before data observation) assumptions, known as prior probabilities, and revises them on the basis of the observed data using Bayes' theorem, sometimes described as the Bayes formula.

Appendix D illustrates how a Bayesian approach to inference can, under certain assumptions on the data generation mechanism and on the a priori likelihood of the hypothesis, use observed data to estimate the probability that a hypothesis is correct. One of the most striking lessons from Bayesian analysis is the profound effect that the pre-experimental odds have on the post-experimental odds. For example, under the assumptions shown in Appendix D, if the prior probability of an experimental hypothesis was only 1 percent and the obtained results were statistically significant at the p ≤ 0.01 level, only about one in eight of such conclusions that the hypothesis was true would be correct. If the prior probability was as high as 25 percent, then more than four of five such studies would be deemed correct. As common sense would dictate and Bayesian analysis can quantify, it is prudent to adopt a lower level of confidence in the results of a study with a highly unexpected and surprising result than in a study for which the results were a priori more plausible (e.g., see Box 2-2).

BOX 2-2

Pre-Experimental Probability: An Example.

Highly surprising results may represent an important scientific breakthrough, even though it is likely that only a minority of them may turn out over time to be correct. It may be crucial, in terms of the example in the previous paragraph, to learn which of the eight highly unexpected (prior probability, 1%) results can be verified and which one of the five moderately unexpected (prior probability, 25%) results should be discounted.

Keeping the idea of prior probability in mind, research focused on making small advances to existing knowledge would result in a high replication rate (i.e., a high rate of successful replications) because researchers would be looking for results that are very likely correct. But doing so would have the undesirable effect of reducing the likelihood of making major new discoveries (Wilson and Wixted, 2018). Many important advances in science have resulted from a bolder approach based on more speculative hypotheses, although this path also leads to dead ends and to insights that seem promising at first but fail to survive after repeated testing.

The “safe” and “bold” approaches to science have complementary advantages. One might argue that a field has become too conservative if all attempts to replicate results are successful, but it is reasonable to expect that researchers follow up on new but uncertain discoveries with replication studies to sort out which promising results prove correct. Scientists should be cognizant of the level of uncertainty inherent in speculative hypotheses and in surprising results in any single study.

【本文地址】

Scientific Methods and Knowledge

Scientific Methods and Knowledge

今日新闻

推荐新闻